Colly是一種Golang的網路爬蟲工具,而網路爬蟲Web Crawler簡而言之就是在網路上能夠自動的進行資料
搜集與解析的工具。
因此這章節我們將會介紹如何使用Colly來進行特定網域與網站的資料搜集!
go get -u github.com/gocolly/colly
app/crawler/collier.go
package crawler
import (
"github.com/gocolly/colly"
"github.com/sirupsen/logrus"
"ironman-2021/app/middleware"
)
func Collier(url string) {
var body string
c := colly.NewCollector(
colly.UserAgent("Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html)"),
)
c.OnRequest(func(r *colly.Request) {
middleware.Logger().WithFields(logrus.Fields{
"name": "Collier",
}).Info("Visiting", r.URL)
})
c.OnError(func(_ *colly.Response, err error) {
middleware.Logger().WithFields(logrus.Fields{
"name": "Collier",
}).Info("Visiting Failed, err: ", err)
})
c.OnResponse(func(r *colly.Response) {
body = string(r.Body)
middleware.Logger().WithFields(logrus.Fields{
"name": "Collier",
}).Info("Visited, body: ", body)
})
c.OnScraped(func(r *colly.Response) {
middleware.Logger().WithFields(logrus.Fields{
"name": "Collier",
}).Info("Finished", r.Request.URL)
})
err := c.Visit(url)
if err != nil {
return
}
}
Collector
的實例叫conResponse
我們是將爬蟲的結果寫入Log外,其餘步驟都是將執行步驟寫入Log之中。main.go
server.GET("/crawler", func(c *gin.Context) {
crawler.Collier("https://ithelp.ithome.com.tw/users/20129737/ironman/4014")
c.String(http.StatusOK, fmt.Sprintf("Finished Coller"))
})
最後我們則是在主程式中加一隻簡單的GET API
來觸發執行。
logs/2021-10-10.log
time="101010-10-10 1010:1010:1010" level=info msg="Health CheckInfo" name="Flynn Sun"
time="101010-10-10 1010:1010:1010" level=info msg="| 200 | 5.3626ms | 172.19.0.1 | GET | /hc |"
time="101010-10-10 1010:1010:1010" level=info msg="Visitinghttps://ithelp.ithome.com.tw/articles/10279931" name=Collier
time="101010-10-10 1010:1010:1010" level=info msg="Visited, body: <!DOCTYPE html>\n<html lang=\"zh-TW\">\n\n<head>\n <meta charset=\"utf-8\">\n<meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge\">\n<meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">\n\n\n<title>Day25 Gin with API Test - iT 邦幫忙::一起幫忙解決難題,拯救 IT 人的一天</title>\n\n<meta name=\"description\" content=\"What is API Test? 我們可以把它想成Unit Test單元測試的一種,不過它所涵蓋的最好集合不像以往的UnitTest可能以Function為主,而是Endpoint。 透過API T...\"/>\n<meta name=\"keywords\" content=\"iT邦幫忙,iThome\">\n<meta name=\"author\" content=\"iThome\">\n<meta property=\"og:site_name\" content=\"iT 邦幫忙::一起幫忙解決難題,拯救 IT 人的一天\"/>\n<meta property=\"og:url\" content=\"https://ithelp.ithome.com.tw/articles/10279931\"/>\n<meta property=\"og:type\" content=\"website\"/>\n<meta property=\"og:title\" content=\"Day25 Gin with API Test - iT 邦幫忙::一起幫忙解決難題,拯救 IT 人的一天\"/>\n<meta property=\"og:image\" content=\"https://ithelp.ithome.com.tw/upload/images/20211010/20129737oKVtf3CBHN.png\"/>\n<meta property=\"og:description\" content=\"What is API Test? 我們可以把它想成Unit Test單元測試的一種,不過它所涵蓋的最好集合不像以往的UnitTest可能以Function為主,而是Endpoint。 透過API T...\"/>\n<meta property=\"fb:app_id\" content=\"137875859607921\" />\n\n<link rel=\"apple-touch-icon\" sizes=\"57x57\" href=\"https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-57x57.png\">\n<link rel=\"apple-touch-icon\" sizes=\"60x60\" href=\"https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-60x60.png\">\n<link rel=\"apple-touch-icon\" sizes=\"72x72\" href=\"https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-72x72.png\">\n<link rel=\"apple-touch-icon\" sizes=\"76x76\" href=\"https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-76x76.png\">\n<link rel=\"apple-touch-icon\" sizes=\"114x114\" href=\"https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-114x114.png\">\n<link rel=\"apple-touch-icon\" sizes=\"120x120\" href=\"https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-120x120.png\">\n<link rel=\"apple-touch-icon\" sizes=\"144x144\" href=\"https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-144x144.png\">\n<link rel=\"apple-touch-icon\" sizes=\"152x152\" href=\"https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-152x152.png\">\n<link rel=\"apple-touch-icon\" sizes=\"180x180\" href=\"https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-180x180.png\">\n<link rel=\"icon\" type=\"image/png\" href=\"https://ithelp.ithome.com.tw/storage/favicons/favicon-32x32.png\" sizes=\"32x32\">\n<link rel=\"icon\" type=\"image/png\"
...
href=\"https://ithelp.ithome.com.tw/storage/favicons/android-chrome-192x192.png\" sizes=\"192x192\">\n<link rel=\"icon\" type=\"image/png\" v>\n <div><a href=\"#\" class=\"invitation-list__account\">{{ result.account }}</a>\n </div>\n </div>\n </li>\n </ul>\n </div>\n <div class=\"modal-footer\">\n <a type=\"button\" class=\"btn btn-main\" data-dismiss=\"modal\">關閉</a>\n </div>\n </div>\n </div>\n </div>\n </body>\n\n</html>" name=Collier
time="101010-10-10 1010:1010:1010" level=info msg="Finishedhttps://ithelp.ithome.com.tw/articles/10279931" name=Collier
time="101010-10-10 1010:1010:1010" level=info msg="| 200 | 623.3349ms | 172.19.0.1 | GET | /crawler |"
我們最後可以在log當中發現我們爬蟲的紀錄!
那接下來則是示範難度更高的爬蟲!
首先來看一下上面我們爬取的頁面結構
(https://ithelp.ithome.com.tw/users/20129737/ironman/4014)
<!DOCTYPE html>
<html lang="zh-TW">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>fmt.Println("從零開始的Golang生活") :: 2021 iThome 鐵人賽</title>
<meta name="description" content="講述一位Python Developer如何從零開始學習Go,並透過該角度進行解析。"/>
<meta name="keywords" content="iT邦幫忙,iThome">
<meta name="author" content="iThome">
<meta property="og:site_name" content="iT 邦幫忙::一起幫忙解決難題,拯救 IT 人的一天"/>
<meta property="og:url" content="https://ithelp.ithome.com.tw/users/20129737/ironman/4014"/>
<meta property="og:type" content="website"/>
<meta property="og:title" content="fmt.Println("從零開始的Golang生活") :: 2021 iThome 鐵人賽"/>
<meta property="og:image" content="https://ithelp.ithome.com.tw/images/ironman/13th/fb.jpg"/>
<meta property="og:description" content="講述一位Python Developer如何從零開始學習Go,並透過該角度進行解析。"/>
<meta property="fb:app_id" content="137875859607921" />
<link rel="apple-touch-icon" sizes="57x57" href="https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-57x57.png">
<link rel="apple-touch-icon" sizes="60x60" href="https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-60x60.png">
<link rel="apple-touch-icon" sizes="72x72" href="https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-72x72.png">
<link rel="apple-touch-icon" sizes="76x76" href="https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-76x76.png">
<link rel="apple-touch-icon" sizes="114x114" href="https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-114x114.png">
<link rel="apple-touch-icon" sizes="120x120" href="https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-120x120.png">
<link rel="apple-touch-icon" sizes="144x144" href="https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-144x144.png">
<link rel="apple-touch-icon" sizes="152x152" href="https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-152x152.png">
<link rel="apple-touch-icon" sizes="180x180" href="https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-180x180.png">
<link rel="icon" type="image/png" href="https://ithelp.ithome.com.tw/storage/favicons/favicon-32x32.png" sizes="32x32">
<link rel="icon" type="image/png" href="https://ithelp.ithome.com.tw/storage/favicons/android-chrome-192x192.png" sizes="192x192">
<link rel="icon" type="image/png" href="https://ithelp.ithome.com.tw/storage/favicons/favicon-96x96.png" sizes="96x96">
<link rel="icon" type="image/png" href="https://ithelp.ithome.com.tw/storage/favicons/favicon-16x16.png" sizes="16x16">
<link rel="manifest" href="https://ithelp.ithome.com.tw/storage/favicons/manifest.json">
<link rel="mask-icon" href="https://ithelp.ithome.com.tw/storage/favicons/safari-pinned-tab.svg" color="#5bbad5">
<meta name="msapplication-TileColor" content="#da532c">
<meta name="msapplication-TileImage" content="https://ithelp.ithome.com.tw/storage/favicons/mstile-144x144.png">
<meta name="theme-color" content="#ffffff">
<link rel="stylesheet" href="https://ithelp.ithome.com.tw/css/bootstrap.min.css">
<link rel="stylesheet" href="//ajax.googleapis.com/ajax/libs/jqueryui/1.11.3/themes/smoothness/jquery-ui.css"/>
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.5.0/css/font-awesome.min.css">
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Lato:400,700">
<link rel="stylesheet" href="//cdn.jsdelivr.net/simplemde/latest/simplemde.min.css">
<link rel="stylesheet" href="https://ithelp.ithome.com.tw/css/sweetalert.css">
<link rel="stylesheet" href="https://ithelp.ithome.com.tw/lib/select2/css/select2.min.css">
<link rel="stylesheet" href="https://ithelp.ithome.com.tw/css/google.css">
<link rel="stylesheet" href="https://ithelp.ithome.com.tw/css/style.css?202008271142">
<!-- highlight -->
<link rel="stylesheet" href="https://ithelp.ithome.com.tw/css/railscasts.css">
<!-- end -->
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
<!--messenger css-->
</head>
<body>
<div class="header">
<div class="header__inner clearfix">
<h1 class="header__logo pull-left"><a href="/"><img src="https://ithelp.ithome.com.tw/storage/image/logo.svg" alt="iT邦幫忙" class="img-responsive"></a></h1>
<div class="header__promote">
<div class="a12word pull-right">
<div class="a12word__box">
<script type="text/javascript" src="https://itadapi.ithome.com.tw/media/serve?type=T2&channel=ithome_forum&encoding=Utf8"> </script>
</div>
<div class="a12word__box">
<script type="text/javascript" src="https://itadapi.ithome.com.tw/media/serve?type=T3&channel=ithome_forum&encoding=Utf8"> </script>
</div>
<div class="a12word__box">
<script type="text/javascript" src="https://itadapi.ithome.com.tw/media/serve?type=T4&channel=ithome_forum&encoding=Utf8"> </script>
</div>
</div>
<div class="a970 pull-right">
<script src="https://itadapi.ithome.com.tw/media/serve?type=B1&channel=ithome_forum&encoding=Utf8"></script>
</div>
</div>
</div>
.............
</body>
</html>
如果我們只想要過濾並爬取每篇鐵人賽文章的標題而已,那我們可以發現文章標題的都會固定在
<body>
→ <div class="board leftside profile-main">
→ <div class="ir-profile-content">
→ <div class="profile-list__content">
然後每個<div class="profile-list__content">
內部都能找到<h3 class="qa-list__title">
→ <a class="qa-list__title-link">
title </a>
...
<body>
...
<div class="board leftside profile-main">
<div class="ir-profile-content">
...
<div class="profile-list__content">
...
<h3 class="qa-list__title">
<a href="https://ithelp.ithome.com.tw/articles/10267570" class="qa-list__title-link">
Day4 Variable
</a>
</h3>
...
<div>
...
...
因此我們透過XPATH的方式來解析並取得我們想要的title
app/crawler/collier.go
c.OnResponse(func(r *colly.Response) {
doc, err := htmlquery.Parse(strings.NewReader(string(r.Body)))
if err != nil {
middleware.Logger().WithFields(logrus.Fields{
"name": "Collier",
}).Fatal("Visited fatal, error: ", err)
}
titles := htmlquery.Find(doc, `//div[@class="board leftside profile-main"]//div[@class="ir-profile-content"]//div[@class="profile-list__content"]`)
for _, node := range titles {
title := htmlquery.FindOne(node, `//h3[@class="qa-list__title"]//a[@class="qa-list__title-link"]/text()`)
middleware.Logger().WithFields(logrus.Fields{
"name": "Collier",
}).Info("Title: ", htmlquery.InnerText(title))
}
})
div[@class="profile-list__content"]
htmlquery.InnerText()
將它轉成string並寫入log當中那寫入log的資料會如下
time="111110-10-10 1010:1010:1010" level=info msg="Visiting: https://ithelp.ithome.com.tw/users/20129737/ironman/4014" name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n Day1 Why Go?\n " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n Day2 Develop Environment For Go\n " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n Day3 First Go application\n " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n Day4 Variable\n " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n Day5 Type\n " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n Day6 Array and Slice\n " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n Day7 Map and Struct\n " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n Day8 Function and Interface\n " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n Day9 Goroutine\n " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n Day10 Sync.WaitGroup & Sync.Map\n " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Finished: https://ithelp.ithome.com.tw/users/20129737/ironman/4014" name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="| 200 | 692.5716ms | 192.168.16.1 | GET | /crawler |"
這章節我們實作如何用Colly爬取鐵人賽的頁面,並打印出所有的標題,以後我們要爬取特定網域或資料時,也不用只局限於使用Python,Go也會是個好選擇!
這次的程式碼我也會放在下方連結提供參考
https://github.com/Neskem/Ironman-2021/tree/Day-27